Comparison of methods for predicting COVID-19-related death in the general population using the OpenSAFELY platform

Background Obtaining accurate estimates of the risk of COVID-19-related death in the general population is challenging in the context of changing levels of circulating infection. Methods We propose a modelling approach to predict 28-day COVID-19-related death which explicitly accounts for COVID-19 infection prevalence using a series of sub-studies from new landmark times incorporating time-updating proxy measures of COVID-19 infection prevalence. This was compared with an approach ignoring infection prevalence. The target population was adults registered at a general practice in England in March 2020. The outcome was 28-day COVID-19-related death. Predictors included demographic characteristics and comorbidities. Three proxies of local infection prevalence were used: model-based estimates, rate of COVID-19-related attendances in emergency care, and rate of suspected COVID-19 cases in primary care. We used data within the TPP SystmOne electronic health record system linked to Office for National Statistics mortality data, using the OpenSAFELY platform, working on behalf of NHS England. Prediction models were developed in case-cohort samples with a 100-day follow-up. Validation was undertaken in 28-day cohorts from the target population. We considered predictive performance (discrimination and calibration) in geographical and temporal subsets of data not used in developing the risk prediction models. Simple models were contrasted to models including a full range of predictors. Results Prediction models were developed on 11,972,947 individuals, of whom 7999 experienced COVID-19-related death. All models discriminated well between individuals who did and did not experience the outcome, including simple models adjusting only for basic demographics and number of comorbidities: C-statistics 0.92–0.94. However, absolute risk estimates were substantially miscalibrated when infection prevalence was not explicitly modelled. Conclusions Our proposed models allow absolute risk estimation in the context of changing infection prevalence but predictive performance is sensitive to the proxy for infection prevalence. Simple models can provide excellent discrimination and may simplify implementation of risk prediction tools. Supplementary Information The online version contains supplementary material available at 10.1186/s41512-022-00120-2.


Introduction
Characterised as a pandemic by the World Health Organization on 11 March 2020 [1], globally, the cumulative number of cases of COVID-19 has exceeded 180 million with almost 4 million deaths attributed to the virus at the time of writing [2]. Evolving policies such as return-to-work strategies and restrictions on social contact are heavily informed by the estimated risk of severe outcomes from COVID-19. Policy-making is often informed by absolute risks in the general population. This risk reflects the result of two processes: being infected and dying once infected, thus depends critically on infection prevalence. Transporting estimates of absolute risk from one context to another, such as a different time period or geographical region, is particularly challenging in COVID-19 due to substantial variation in the infection prevalence over time and by geography [3].
Prediction models that do not explicitly model the underlying infection prevalence may provide inaccurate estimates of absolute risk. Whether such models produce transportable rankings of risk, to different temporal and geographical contexts, is uncertain. Some predictors identified in these models, such as geographical region and characteristics that differ substantially by region, may be indirectly capturing differences in infection prevalence; models explicitly including the infection prevalence may allow simpler predictive models with equally good performance.
Incorporating the underlying infection prevalence is not straightforward because direct estimates are typically not available within small geographical areas. Whether easily accessible proxy measures are sufficiently accurate to produce reasonable risk estimates remains uncertain.
In this study, we compare modelling strategies to predict 28-day COVID-19 death which do and do not allow for dynamic predictions and illustrate the impact that these methodological choices have, in terms of model discrimination and calibration, using data from the first wave of COVID-19 in England held in the OpenSAFELY platform [4] on almost 12 million adults in England. Our overarching goal is to inform subsequent development of risk prediction models in this context, rather than to develop a risk prediction model to inform policymaking. Specifically, we aim to answer the following: (1) Do risk prediction models which do not explicitly model the underlying infection prevalence perform well in different geographical and temporal contexts? (2) Can transportable estimates of absolute risk of COVID-19 death be obtained by explicitly incorporating proxy estimates of the changing infection prevalence? (3) Can simpler prediction algorithms be used for predicting risk without losing substantial predictive ability, compared with more richly specified models?

Methods
Our statistical approach is based on a model for 28-day COVID-19-related death detailed in the Supplementary Appendix. Our reporting adheres to the TRIPOD statement for reporting of multivariable prediction models [5].

Study population
The target population is adults in England living in the community; residential settings are excluded since risks experienced in institutions such as care homes are likely to be very different to those in smaller households.

Data source
Primary care records managed by the GP software provider TPP were linked to Office for National Statistics (ONS) death data through OpenSAFELY, a data analytics platform created by our team on behalf of NHS England to address urgent COVID-19 research questions (https://opensafely.org). OpenSAFELY provides a secure software interface allowing the analysis of pseudonymised primary care patient records from England in near real-time within the electronic health record vendor's highly secure data centre, avoiding the need for large volumes of potentially disclosive pseudonymised patient data to be transferred off-site. This, in addition to other technical and organisational controls, minimises any risk of re-identification. Similarly, pseudonymised datasets from other data providers are securely provided to the electronic health record vendor and linked to the primary care data. Data includes pseudonymised data such as coded diagnoses, medications, and physiological parameters. No free text data are included.

Base cohort
A base cohort was defined, comprising males and females aged 18 years or older registered as of 1 March 2020 in a general practice employing the TPP system, followed up for 100 days (1 March 2020 until 8 June 2020). Individuals with missing age, sex, postcode, ethnicity, a recorded age over 105 years, or living in households of >10 people were excluded.

Predictor variables
We selected candidate predictors based on known or plausible associations with exposure to COVID-19 infection, risk of severe illness or respiratory tract infection, and factors associated with healthcare access or level of care. Potential predictors were age, sex, ethnicity, deprivation, number in household, presence of young children in household, a rural indicator, obesity, smoking, blood pressure, and comorbidities (details in Supplementary Appendix).
Three different proxy measures of COVID-19 infection prevalence, measured daily, were available: (1) model-based estimates [6] available by region and age group, (2) rate of COVID-19-related A&E (emergency) attendances over the last week by local geographic area, and (3) rate of suspected COVID-19 cases over the last week by local geographic area (details in Supplementary Appendix).

Development of risk prediction algorithms
Our goal is to predict the risk of 28-day COVID-19 death. Note that this is not death within 28 days of infection, but the risk of experiencing COVID-19 death within a 28-day period for individuals in our target population. Two approaches were adopted: approach A-models which do not explicitly account for the timechanging prevalence of COVID-19 infection, and approach B-landmarking models [7], which use a series of sequential overlapping 28-day sub-studies incorporating time-updating proxy measures of infection prevalence. Risk prediction models were developed using data from the base cohort. Case-cohort sampling was required to enable model fitting for approach B (the stacked substudies contain nearly 900 million rows of data). To reduce computational burden and increase comparability between approaches, case-cohort sampling was used to fit models for both approaches.
To develop models within approach A, follow-up began 1 March 2020 and ended at the first of COVID-19-related death or 8 June 2020. The outcome was COVID-19-related death (any time during the 100-day follow-up). No censoring was applied at death due to non-COVID causes, to target the sub-distribution hazard [8]; no other censoring events occurred. The analysis sample included all cases of COVID-19-related death and a random age-stratified sample of the base cohort (the 'sub-cohort') [9,10], with sampling fractions of 0.01 in the age group 18-<40, 0.02 in 40-<60, 0.025 in 60-< 70, 0.05 in 70-<80 and 0.13 in 80+ years.
Four approach A models were fitted, using different sets of predictor variables. The first used penalised regression (lasso) [11] to select a parsimonious model (the "selected" model; models were subsequently re-fitted including only the selected variables). The second included age group (10-category), sex and their interaction ("agesex" model). The third additionally included grouped number of comorbidities (0, 1, 2, 3+), ethnicity and rural/urban ("comorbidities" model). The fourth included all 36 potential predictors and all possible interactions with age and sex ("full" model, details in the Supplementary Appendix).
Cox proportional hazards model were fitted including the relevant predictors using time in study as the timescale with Barlow weights to account for the case-cohort design and robust standard errors [9,10]. The baseline survivor function was estimated at day 28 (Ŝ 28 ) and the estimated log hazard ratios (β) were extracted. For an individual with predictor values x i , the predicted risk is then given by [12]: To develop models within approach B, using landmarking models [7], a series of 73 overlapping sequential sub-studies were extracted from the base cohort. The sub-studies started 0, 1, 2, 3, 4…, 72 days after 1 March 2020 (Fig. 1). Follow-up started at the sub-study start date and ended at the first of COVID-19-related death or 28 days after sub-study entry. Individuals were not censored at deaths due to other causes. The outcome was COVID-19-related death during the 28-day period. Each sub-study had a case-cohort design, including all eligible individuals (those in the base cohort still alive at the sub-study start) who experienced a COVID-19related death during the sub-study period and an agestratified random sample of sub-study eligible individuals, with age group-specific sampling fractions equal to 1/70 of the sampling fractions for approach A. Data from all sub-studies were combined for analysis. Predictor variables were assessed at day 0 of each substudy.
Functional forms for the proxy measures of COVID-19 infection prevalence were selected using Akaike's Information Criterion, with our selection of candidate functional forms guided by our proposed model for COVID-19 death (Supplementary Appendix).
A Poisson model for 28-day COVID-19-related death was fitted to the combined dataset using Barlow weights and robust standard errors, incorporating predictorschosen via the lasso for the "selected" model, prespecified for the "age-sex", "comorbidities" and "full" models-and proxy measures of COVID-19 infection prevalence. In each case, 3 models were fitted: one for each proxy measure.
The predicted risk is given by the same equation as for the Cox model, replacing the hazard ratios by incidence rate ratios and obtaining the baseline survivor function from the estimated constant coefficient.

Model validation
Calibration and discrimination were assessed for 17 risk prediction algorithms (5 approach A; 12 approach B). For approach A, this was the "selected", "age-sex", "comorbidities" and "full" models, as well as the existing COVID-AGE risk tool (10th update) [13,14]. For approach B, this was the "selected", "age-sex", "comorbidities" and "full" models incorporating each of the 3 proxy measures of infection prevalence.

Overall internal validation
From the base cohort, three validation cohorts were defined each lasting 28 days. The three validation cohorts covered periods with higher and lower infection prevalence ( Fig. 2), to allow comparison of modelling strategies under these different conditions. Each validation cohort study comprised all individuals from the base cohort who remained alive at the start of the validation period. Predictor variables were assessed at day 0 of the validation cohort; predicted risk of 28-day COVID-19related death was obtained using each of the 17 algorithms. Model performance was assessed by comparing these predicted risks to the observed binary outcome, 28-day COVID-19-related death (death within the 28day period of the validation cohort). Discrimination was assessed by Harrell's C-statistic [15,16]. Calibration was assessed by estimating the calibration intercept and slope and comparing observed and predicted risk, overall and by groups of predicted risk [16,17]. Flexible calibration curves were drawn. Model performance was assessed overall and within sex and broad age groups (18-<70, 70-<80 and 80+; insufficient events occurred in the youngest age group to split further) and regions.
COVID-AGE is primarily a risk stratification tool, so we did not obtain calibration measures for this tool [14].

Geographical and temporal internal validation
Although we were unable to perform external validation, we undertook additional forms of internal validation as recommended by Steyerberg et al. [18]. To assess model performance in different geographical contexts, the 16 risk prediction models developed within these data were re-fitted after excluding all individuals from a particular geographical region [18]. The resulting models were used to predict risk in the subset of the 3 validation cohorts comprising individuals from the excluded geographical region. Model performance was assessed within the excluded region (i.e. validation was performed using individuals whose data was not used in developing the risk prediction models). This process was repeated for each of the 7 regions. To assess model performance in different temporal contexts, the 16 risk prediction models were re-fitted after excluding the last 28 days of data (i.e. the whole time period of validation period 3). The resulting models were used to predict risk in validation cohort 3. Individuals who remained alive at the start of validation cohort 3 appeared in both the model development data and the validation data, but their predictor values were updated at the start of validation period 3; individuals who experienced the event in the development data did not appear in the validation data.

Missing data
Ethnicity, body mass index (BMI) and smoking data are collected in response to clinical need, thus likely to be missing not at random. As smoking and obesity, if present, are likely to be recorded, individuals with missing BMI were assumed non-obese and individuals with no smoking information were assumed non-smokers. Individuals with no serum creatinine measurement were included in the "no evidence of poor kidney function" group. Individuals with diabetes but no glycosylated haemoglobin (HbA1c) measurement were included in a separate "diabetes, no HbA1c" category. Analysis was restricted to individuals with recorded age, sex and ethnicity data; such complete case analysis is valid under a range of missing not at random assumptions [19].

Sensitivity analysis
Additional sensitivity analyses were undertaken (Supplementary Appendix), including refitting the approach A "selected" model in the entire cohort (without casecohort sampling), leading to very similar results, and using a Weibull model for both approaches A and B to increase comparability between approaches, resulting in similar conclusions about the comparison between approaches. Further details of the methods used can be found in our pre-published protocol [20].

Software
Data management was performed using Python and Google BigQuery, with analysis carried out using Stata 16.1/Python. All code used is openly shared online for review and re-use under MIT open license (https:// github.com/opensafely/risk-prediction-research).

Patient and public involvement
We invite any patient or member of the public to contact us regarding this study or the broader OpenSAFELY project through our website https://opensafely.org/.

Results
The base cohort comprised almost 12 million individuals, of whom 7999 experienced a COVID-19-related death (

Discrimination
In all validation periods, the C-statistic for the "selected" models was high (0.92-0.94), indicating excellent ability to distinguish between individuals who experienced 28day COVID-19 death and who did not ( Table 2). No difference in discrimination was seen between approaches A and B.
Among the oldest age group (80+ years), discrimination was much lower than in younger age groups (Table 3), reflecting the substantial discrimination that comes from age which is partly lost after age restriction. For example, for the approach A "selected" model in validation period 1, the C-statistic was 0.77 for females and 0.65 for males among the 80+ age group compared with 0.88 and 0.91 among the 18-<70 age group. Discrimination was similar across regions (Table S4; range of Cstatistics across regions and validation periods: approach A 0.903-0.962, approach B 0.907-0.966, other than in the North West region in validation period 1, which had lower discrimination: approach A C-statistic 0.84; approach B 0.85-0.86).
Geographical internal validation showed that discrimination of models was largely insensitive to the removal of a region (Tables 4 and S7, Figure S9), with the exception of omission of the North-West region in validation period 1, which resulted in lower discrimination for all models. Removing validation period 3 from the development data, in the temporal validation, did not reduce discrimination for validation period 3.

Absolute risk estimation
For the approach A Cox model, the mean predicted and observed risks were very similar in the first validation period (the initial portion of the data on which the  model was fitted), but different in the second and third ( Table 2). In validation period 2, the mean observed risk was ten times higher than in validation period 1, but the mean predicted risk was (by design, since it ignores the infection prevalence) almost identical for all three periods.
For the approach B models incorporating model-based estimates of infection prevalence, the mean and observed risks were very similar in all validation periods. The calibration intercept was slightly less than zero in validation periods 1 and 3, indicating slight over-estimation on average, with calibration slopes close to one. Replacing model-based estimates by either the rate of A&E COVID-19 attendances or the rate of suspected COVID-19 cases in primary care resulted in poorer calibration, particularly in the first validation period which had a very low infection prevalence. All approach B models had worse calibration than the approach A model in  validation period 1 but considerably better calibration than the approach A model in the other two validation periods. Figures S3-S6 show flexible calibration curves and predicted versus observed risks by twentieths of predicted risk. Calibration intercepts varied by region (e.g. approach A in validation period 1, range of estimated calibration intercepts: −1, +1.12), but calibration slopes varied less (Table S4). The intercepts varied least for approach B in validation period 2 (range of estimated calibration intercepts: −0.28, 0.21), which was the period with the highest infection prevalence.
Geographical internal validation showed that estimated calibration intercepts were sensitive to removal of a geographical region, but slopes less so. Using approach A models, the temporal validation results were similar to the overall internal validation findings. In contrast, using approach B, we found worse calibration under the temporal validation (Table S6, Fig. S8).

Models with fewer predictors
A simple model including only age and sex provided reasonable discrimination (C-statistic~0.80) among the 18-<70 age group (Tables 3 and S5). Among older age groups, in contrast, discrimination was substantially lower in the "age-sex" model compared to the models with more predictors.
Among females in the 18-<70 age group, both the "comorbidities" and "full" models had C-statistics of 0.89 in validation period 1. In males, the analogous numbers were 0.92 ("comorbidities") and 0.93 ("full"). Among this age group, across the validation periods in both approaches A and B, discrimination was at most 0.03 lower for the "comorbidities" compared to the "full" model.

Discussion
We have proposed a modelling approach based on landmarking which enables dynamic incorporation of timeand region-dependent information on infection prevalence. We have demonstrated that our modelling approach can provide well-calibrated estimates, with good discrimination, of absolute risk of 28-day COVID-19 death in the general population. In contrast, absolute risk estimates cannot be transported from models which do not explicitly model the infection prevalence, although they did rank individuals well. We demonstrated that the performance of our proposed modelling approach depended critically on the proxy used for infection prevalence. We found that the calibration of our modelling approach worsened when applied to time periods with patterns of infection not seen in the model development data. We did not undertake external validation; performance may worsen in completely new settings. Identifying a readily available proxy of infection prevalence is challenging in practice. Models including model-based estimates provided the best performance, but this may be impractical for most situations due to the complex modelling required to obtain these estimates. Models using proxies more readily available through automated data streams gave slightly inferior performance, albeit considerably better than ignoring the burden of infection entirely. Proxies involving more serious consequences of infection, such as A&E attendance, performed poorly when the underlying infection prevalence was low, since these serious consequences are rare and therefore imprecisely estimated in low prevalence periods. Furthermore, one limitation of this work was that we were unable to explore the level of granularity required in the data used as a proxy for infection prevalence. Where COVID-19 testing is geographically widespread, direct estimates of local infection are likely to provide the best measure. While this was unavailable in the UK at our study start, more recent testing data are much more comprehensive.
Joint modelling offers an alternative approach to obtaining dynamic predictions [21]. We chose a landmarking approach because it is computationally efficient and it is easier to extend to make predictions beyond the temporal subset of data used for model development.
Our underlying theoretical model for COVID-19related death suggested it may be possible to separate the dynamics of the epidemic from the process of risk prediction based on patient characteristics, allowing the infection process to be ignored, provided that models carefully account for geographical region. Surprisingly, we found that our approach A models, which did not account for region, were still able to provide good discrimination which transported well geographically and temporally. This provides reassurance that these potential theoretical biases do not lead to substantial deterioration in practical performance in terms of ranking patients, but region-specific performance suggested accounting for region may be important when good calibration is desired.
We found that the existing COVID-AGE tool [13,14] had very high discrimination, similar to the best performing models considered, suggesting this provides a reliable ranking of COVID-19 mortality risk. Interestingly, very simple models including only age, sex, ethnicity, a rural indicator, and a count of total comorbidities led to models with very good discrimination. When focusing on specific patient groups with higher morbidity, more complex models may provide useful additional discrimination, but in many cases much simpler models are able to discriminate well.
We did not correct any of the models for optimism [22,23], although the similarity of the C-statistics in the overall and geographical internal validations suggests optimism was not a major problem. We compared a number of models that differed in terms of parsimony, ranging from no variable selection to a model containing only age and sex. The regression penalization used in the selected model has the effect of shrinking model coefficients; however, over-optimism can still remain [24]. Electronic health record data are not collected for research, so information on certain characteristics can be incomplete or absent. Our approach to missing data reflected the way in which these models might be used in practice if applied within electronic health record systems. Our measures of model performance reflect performance under this implementation [25]. Differences observed between approaches are unlikely to have been affected by the approach taken to missing data.

Conclusion
The pandemic and our response to it are evolving rapidly. As vaccines and booster programmes roll out, risk prediction models will need to be updated and recalibrated. Our results have a number of implications for researchers developing risk prediction algorithms in COVID-19. First, models that do not explicitly model the infection prevalence can provide good discrimination which transports well geographically and temporally. However, the calibration of such models is poor when applied to different temporal and geographical contexts. When calibration is important, models must explicitly model the infection prevalence. However, performance of these models is very sensitive to the quality of the proxy measure used. Furthermore, our temporal validation suggested that for these models, calibration declines when applied to data which has patterns of infection prevalence not seen in the data used to develop the model, such as occurred in our third validation period. Discrimination, conversely, did not decline. For all models, region-specific calibration was worse than overall calibration, suggesting that there may be regional differences which are important for COVID-19 risk prediction beyond (our measures of) infection levels. While we have focused on COVID-19 death, these findings are relevant for models predicting other severe outcomes of COVID-19 in the general population, such as hospitalisation.
Finally, most of the discriminating power in each model evaluated here was driven by simple features such as age, sex and a count of comorbidities. Complex risk prediction models driven by multiple variables from diverse sources can be difficult, slow and expensive to implement in routine care: our results suggest that the opportunity costs and complexity of such implementations may not be warranted. The finding that simple models produce very high discrimination also suggests that policies targeting population-level reduction of COVID-19 mortality risk may not need to distinguish between all comorbidities in detail. For policy decisions, including ongoing restrictions on social contact and return-to-work strategies, a simple approach to predicting risk incorporating simple eligibility criteria, could accelerate programme rollout and delivery of policies.
The views expressed are those of the authors and not necessarily those of the NIHR, NHS England, Public Health England or the Department of Health and Social Care. Funders had no role in the study design, collection, analysis and interpretation of data; in the writing of the report; and in the decision to submit the article for publication.

Availability of data and materials
Access to the underlying identifiable and potentially re-identifiable pseudonymised electronic health record data is tightly governed by various legislative and regulatory frameworks and restricted by best practice. The data in OpenSAFELY is drawn from General Practice data across England where TPP is the Data Processor. TPP developers (CB, RC, JP, FH, and SH) initiate an automated process to create pseudonymised records in the core OpenSAFELY database, which are copies of key structured data tables in the identifiable records. These are linked onto key external data resources that have also been pseudonymised via SHA-512 one-way hashing of NHS numbers using a shared salt. DataLab developers and PIs (BG, LS, CEM, SB, AJW, WJH, DE, PI, and CTR) holding contracts with NHS England have access to the OpenSAFELY pseudonymised data tables as needed to develop the OpenSAFELY tools. These tools in turn enable researchers with OpenSAFELY Data Access Agreements to write and execute code for data management and data analysis without direct access to the underlying raw pseudonymised patient data and to review the outputs of this code. All code for the full data management pipeline-from raw data to completed results for this analysis-and for the OpenSAFELY platform as a whole is available for review at github.com/OpenSAFELY.

Declarations
Ethics approval and consent to participate NHS England is the data controller; TPP is the data processor; and the key researchers on OpenSAFELY are acting on behalf of NHS England. This implementation of OpenSAFELY is hosted within the TPP environment which is accredited to the ISO 27001 information security standard and is NHS IG Toolkit compliant [18,19]; patient data has been pseudonymised for analysis and linkage using industry standard cryptographic hashing techniques; all pseudonymised datasets transmitted for linkage onto OpenSAFELY are encrypted; access to the platform is via a virtual private network (VPN) connection, restricted to a small group of researchers; the researchers hold contracts with NHS England and only access the platform to initiate database queries and statistical models; all database activity is logged; only aggregate statistical outputs leave the platform environment following best practice for anonymisation of results such as statistical disclosure control for low cell counts [20]. The OpenSAFELY research platform adheres to the obligations of the UK General Data Protection Regulation (GDPR) and the Data Protection Act 2018. In March 2020, the Secretary of State for Health and Social Care used powers under the UK Health Service (Control of Patient Information) Regulations 2002 (COPI) to require organisations to process confidential patient information for the purposes of protecting public health, providing healthcare services to the public and monitoring and managing the COVID-19 outbreak and incidents of exposure; this sets aside the requirement for patient consent [21]. Taken together, these provide the legal bases to link patient datasets on the Open-SAFELY platform. GP practices, from which the primary care data are obtained, are required to share relevant health information to support the public health response to the pandemic and have been informed of the OpenSAFELY analytics platform. This study was approved by the Health Research Authority (REC reference 20/LO/0651) and by the LSHTM Ethics Board (reference 21863).

Consent for publication
Not applicable